A Faster Structured-Tag Word-Classification Method
نویسنده
چکیده
Several methods have been proposed for processing a corpus to induce a tagset for the sublanguage represented by the corpus. This paper examines a structured-tag word classification method introduced by McMahon (1994). Two major variations (non-random initial assignment of words to classes and moving multiple words in parallel) together provide robust non-random results with a speed increase of 200% to 450%, at the cost of slightly lower quality than McMahon’s method’s average quality. Two further variations, retaining information from lessfrequent words and avoiding reclustering closed classes, are proposed for further study.
منابع مشابه
An Improved Tag Dictionary for Faster Part-of-Speech Tagging
Ratnaparkhi (1996) introduced a method of inferring a tag dictionary from annotated data to speed up part-of-speech tagging by limiting the set of possible tags for each word. While Ratnaparkhi’s tag dictionary makes tagging faster but less accurate, an alternative tag dictionary that we recently proposed (Moore, 2014) makes tagging as fast as with Ratnaparkhi’s tag dictionary, but with no decr...
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملTag-Weighted Topic Model For Large-scale Semi-Structured Documents
To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs r...
متن کاملTag-Weighted Topic Model for Mining Semi-Structured Documents
In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus. Meanwhile, more and more document data come up with rich human-provided tag information during the evolution of the Internet, which called semistructured data. The semi-structured data contain both unstructured data (e.g., plain text) and metad...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره cmp-lg/9610004 شماره
صفحات -
تاریخ انتشار 1996